Bound Ratis reconfiguration retries and add region migration ITs#17895
Conversation
When scaling in a DataNode, if a peer that is being ADDED to a Ratis schema/data region (during region migration) is killed before it catches up, the leader's reconfiguration can never commit. The Ratis client used for reconfiguration retried forever (retryForeverWithSleep), so the coordinator DataNode's AddRegionPeerTask blocked indefinitely inside setConfiguration, leaving the migration permanently stuck -- and CANCEL ineffective, since the task never left PROCESSING. Bound the reconfiguration retries instead of retrying forever: after the limit is exhausted the last failure propagates, so the upper layer can fail and roll back the migration (and CANCEL becomes reachable again). Expose the limit as a per-group config, pushed from ConfigNode to DataNode via TRatisConfig (optional fields, for rolling-upgrade safety): - config_node_ratis_reconfiguration_max_retry_attempts - schema_region_ratis_reconfiguration_max_retry_attempts - data_region_ratis_reconfiguration_max_retry_attempts Default 600 attempts at the fixed 2s interval (~20min cap). Also rename the now-misnamed EndlessRetryFactory / RatisEndlessRetryPolicy to Reconfiguration*, since the policy is no longer endless.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17895 +/- ##
============================================
+ Coverage 40.68% 40.74% +0.06%
- Complexity 2620 2621 +1
============================================
Files 5244 5244
Lines 362374 362696 +322
Branches 46653 46687 +34
============================================
+ Hits 147419 147793 +374
+ Misses 214955 214903 -52 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| /** | ||
| * RatisConsensus protocol, max retry attempts for a configuration change (add/remove peer). Uses | ||
| * a fixed 2s retry interval; bounding the attempts stops a killed ADDING peer from blocking the | ||
| * reconfiguration -- and hence a region migration -- forever. | ||
| */ | ||
| private int configNodeRatisReconfigurationMaxRetryAttempts = 600; | ||
|
|
||
| private int dataRegionRatisReconfigurationMaxRetryAttempts = 600; | ||
| private int schemaRegionRatisReconfigurationMaxRetryAttempts = 600; | ||
|
|
There was a problem hiding this comment.
Will 20 minutes be too long? May use a similar interval, referring to how CN determines whether a node is unknown.
There was a problem hiding this comment.
Agreed. 20 minutes is too long for this failure path. I changed the default reconfiguration retry attempts from 600 to 15, which is about 30 seconds with the fixed 2s retry interval. This keeps it close to the ConfigNode Unknown detection window while leaving a small buffer for transient delays. The newly added Ratis region migration ITs have also been moved into DailyIT after the CI passed.
|



Problem
When a DataNode that hosts a newly ADDING peer is killed during a Ratis region migration, the Ratis configuration change can stay in the staging state. The previous reconfiguration client retried these failures forever, so the DataNode add-peer task could remain PROCESSING indefinitely and the ConfigNode migration procedure could not fail and roll back.
Changes
reconfigurationMaxRetryAttemptsclient setting. Reconfiguration requests still use a fixed 2s retry interval, but now stop after the configured attempt count and propagate the last failure to the caller.ReconfigurationRetryFactoryandRatisReconfigurationRetryPolicy.config_node_ratis_reconfiguration_max_retry_attemptsschema_region_ratis_reconfiguration_max_retry_attemptsdata_region_ratis_reconfiguration_max_retry_attempts15, which is roughly 30 seconds with the fixed 2s interval. This keeps the timeout close to the default ConfigNode Unknown detection window while still leaving a small buffer for transient delays.TRatisConfigfields. The fields are optional so rolling upgrades can fall back to local DataNode defaults when an old ConfigNode does not send them.Ratis IT Coverage
The branch adds Ratis region migration ITs to match the IoTV1 coverage pattern and cover Ratis-specific consensus-pipe states:
CREATE_NEW_REGION_PEERCREATE_CONSENSUS_PIPESDO_ADD_REGION_PEERUPDATE_REGION_LOCATION_CACHETRANSFER_REGION_LEADERREMOVE_REGION_PEERDELETE_OLD_REGION_PEERDROP_CONSENSUS_PIPESREMOVE_REGION_LOCATION_CACHECREATE_CONSENSUS_PIPESandDROP_CONSENSUS_PIPES.The Ratis IT setup uses reconfiguration retry count
10to keep failure-path tests bounded. The new Ratis IT classes are now tagged withDailyITafter passing CI.Verification
mvn spotless:apply -pl integration-test -P with-integration-testsmvn test-compile -pl integration-test -P with-integration-tests -DskipTestsmvn verify -DskipUTs -Dit.test=IoTDBRegionMigrateAddingPeerCrashForRatisIT,IoTDBRegionMigrateConfigNodeCrashForRatisIT,IoTDBRegionMigrateClusterCrashForRatisIT -DintegrationTest.includedGroups= -DintegrationTest.excludedGroups= -DfailIfNoTests=false -Dfailsafe.failIfNoSpecifiedTests=false -pl integration-test -P with-integration-testsmvn spotless:check -pl integration-test -P with-integration-testsgit diff --check